instrumental goal
Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture
Reinforcement Learning is a machine learning methodology that has demonstrated strong performance across a variety of tasks. In particular, it plays a central role in the development of artificial autonomous agents. As these agents become increasingly capable, market readiness is rapidly approaching, which means those agents, for example taking the form of humanoid robots or autonomous cars, are poised to transition from laboratory prototypes to autonomous operation in real-world environments. This transition raises concerns leading to specific requirements for these systems - among them, the requirement that they are designed to behave ethically. Crucially, research directed toward building agents that fulfill the requirement to behave ethically - referred to as artificial moral agents(AMAs) - has to address a range of challenges at the intersection of computer science and philosophy. This study explores the development of reason-based artificial moral agents (RBAMAs). RBAMAs are build on an extension of the reinforcement learning architecture to enable moral decision-making based on sound normative reasoning, which is achieved by equipping the agent with the capacity to learn a reason-theory - a theory which enables it to process morally relevant propositions to derive moral obligations - through case-based feedback. They are designed such that they adapt their behavior to ensure conformance to these obligations while they pursue their designated tasks. These features contribute to the moral justifiability of the their actions, their moral robustness, and their moral trustworthiness, which proposes the extended architecture as a concrete and deployable framework for the development of AMAs that fulfills key ethical desiderata. This study presents a first implementation of an RBAMA and demonstrates the potential of RBAMAs in initial experiments.
- Europe (0.92)
- North America > United States (0.92)
- Research Report > New Finding (1.00)
- Overview (0.92)
- Transportation > Ground > Road (0.47)
- Information Technology > Robotics & Automation (0.33)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
Misalignment from Treating Means as Ends
Marklund, Henrik, Infanger, Alex, Van Roy, Benjamin
Reward functions, learned or manually specified, are rarely perfect. Instead of accurately expressing human goals, these reward functions are often distorted by human beliefs about how best to achieve those goals. Specifically, these reward functions often express a combination of the human's terminal goals -- those which are ends in themselves -- and the human's instrumental goals -- those which are means to an end. We formulate a simple example in which even slight conflation of instrumental and terminal goals results in severe misalignment: optimizing the misspecified reward function results in poor performance when measured by the true reward function. This example distills the essential properties of environments that make reinforcement learning highly sensitive to conflation of instrumental and terminal goals. We discuss how this issue can arise with a common approach to reward learning and how it can manifest in real environments.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm
There is growing interest in giving AI agents access to cryptocurrencies as well as to the smart contracts that transact them. But doing so, this position paper argues, could lead to formidable new vectors of AI harm. To support this argument, we first examine the unique properties of cryptocurrencies and smart contracts that could lead to these new vectors of harm. Next, we describe each of these new vectors of harm in detail. Finally, we conclude with a call for more technical research aimed at preventing and mitigating these harms and, thereby making it safer to endow AI agents with cryptocurrencies and smart contracts.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Oceania > New Zealand (0.04)
- (13 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Banking & Finance > Trading (1.00)
- Banking & Finance > Economy (1.00)
Why Do Some Language Models Fake Alignment While Others Don't?
Sheshadri, Abhay, Hughes, John, Michael, Julian, Mallen, Alex, Jose, Arun, Janus, null, Roger, Fabien
Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
- North America > United States (0.14)
- North America > Mexico > Sinaloa (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)
Technical Report: Evaluating Goal Drift in Language Model Agents
Arike, Rauno, Donoway, Elizabeth, Bartsch, Henning, Hobbhahn, Marius
As language models (LMs) are increasingly deployed as autonomous agents, their robust adherence to human-assigned objectives becomes crucial for safe operation. When these agents operate independently for extended periods without human oversight, even initially well-specified goals may gradually shift. Detecting and measuring goal drift - an agent's tendency to deviate from its original objective over time - presents significant challenges, as goals can shift gradually, causing only subtle behavioral changes. This paper proposes a novel approach to analyzing goal drift in LM agents. In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures. We demonstrate that while the best-performing agent (a scaffolded version of Claude 3.5 Sonnet) maintains nearly perfect goal adherence for more than 100,000 tokens in our most difficult evaluation setting, all evaluated models exhibit some degree of goal drift. We also find that goal drift correlates with models' increasing susceptibility to pattern-matching behaviors as the context length grows.
- Research Report > New Finding (1.00)
- Overview (1.00)
Evaluating Language Model Character Traits
Ward, Francis Rhys, Yang, Zejia, Jackson, Alex, Brown, Randy, Smith, Chandler, Colverd, Grace, Thomson, Louis, Douglas, Raymond, Bartak, Patrik, Rowan, Andrew
Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM's behavior in the preceding interaction. Our formalism enables us to describe LM behaviour precisely in intuitive language, without undue anthropomorphism.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
The Reasons that Agents Act: Intention and Instrumental Goals
Ward, Francis Rhys, MacDermott, Matt, Belardinelli, Francesco, Toni, Francesca, Everitt, Tom
Intention is an important and challenging concept in AI. It is important because it underlies many other concepts we care about, such as agency, manipulation, legal responsibility, and blame. However, ascribing intent to AI systems is contentious, and there is no universally accepted theory of intention applicable to AI agents. We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision. We introduce a formal definition of intention in structural causal influence models, grounded in the philosophy literature on intent and applicable to real-world machine learning systems. Through a number of examples and results, we show that our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work. In addition, we show how our definition relates to past concepts, including actual causality, and the notion of instrumental goals, which is a core idea in the literature on safe AI agents. Finally, we demonstrate how our definition can be used to infer the intentions of reinforcement learning agents and language models from their behaviour.
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- North America > United States > Tennessee > Shelby County > Memphis (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Artificial Persuasion Takes Over the World
Blurb: Narrates a fictional future where persuasive Artificial General Intelligence (AGI) goes rogue. Inspired in part by the AI Vignettes Project. A fondness for irony will help readers. "AI-powered memetic warfare makes all humans effectively insane." You can't trust any content from anyone you don't know. Phone calls, texts, and emails are poisoned. But the current waste and harm from scammers, influencers, propagandists, marketers, and their associated algorithms are nothing compared to what might happen. Coming AIs might be super-persuaders, and they might have their own very harmful agendas. People being routinely unsure of what's reality is one bad outcome, but there are others worse.
User Tampering in Reinforcement Learning Recommender Systems
Evans, Charles, Kasirzadeh, Atoosa
This paper provides the first formalisation and empirical demonstration of a particular safety concern in reinforcement learning (RL)-based news and social media recommendation algorithms. This safety concern is what we call "user tampering" -- a phenomenon whereby an RL-based recommender system may manipulate a media user's opinions, preferences and beliefs via its recommendations as part of a policy to increase long-term user engagement. We provide a simulation study of a media recommendation problem constrained to the recommendation of political content, and demonstrate that a Q-learning algorithm consistently learns to exploit its opportunities to 'polarise' simulated 'users' with its early recommendations in order to have more consistent success with later recommendations catering to that polarisation. Finally, we argue that given our findings, designing an RL-based recommender system which cannot learn to exploit user tampering requires making the metric for the recommender's success independent of observable signals of user engagement, and thus that a media recommendation system built solely with RL is necessarily either unsafe, or almost certainly commercially unviable.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (15 more...)
- Information Technology (0.67)
- Media (0.66)
- Leisure & Entertainment (0.46)
Superintelligent, Amoral, and Out of Control - Issue 84: Outbreak
In the summer of 1956, a small group of mathematicians and computer scientists gathered at Dartmouth College to embark on the grand project of designing intelligent machines. The ultimate goal, as they saw it, was to build machines rivaling human intelligence. As the decades passed and AI became an established field, it lowered its sights. There were great successes in logic, reasoning, and game-playing, but stubborn progress in areas like vision and fine motor-control. This led many AI researchers to abandon their earlier goals of fully general intelligence, and focus instead on solving specific problems with specialized methods.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Germany > Berlin (0.04)
- Leisure & Entertainment > Games (0.71)
- Law (0.47)